{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用Embedding API\n",
    "注：为了方便embedding api调用，应将密钥填入llm_universe下的.env文件，代码将自动读取并加载环境变量。\n",
    "## 一、使用OpenAI API\n",
    "GPT有封装好的接口，我们简单封装即可。目前GPT embedding mode有三种，性能如下所示：\n",
    "|模型 | 每美元页数 | [MTEB](https://github.com/embeddings-benchmark/mteb)得分 | [MIRACL](https://github.com/project-miracl/miracl)得分|\n",
    "| --- | --- | --- | --- |\n",
    "|text-embedding-3-large|9,615|64.6|54.9|\n",
    "|text-embedding-3-small|62,500|62.3|44.0|\n",
    "|text-embedding-ada-002|12,500|61.0|31.4|\n",
    "* MTEB得分为embedding model分类、聚类、配对等八个任务的平均得分。\n",
    "* MIRACL得分为embedding model在检索任务上的平均得分。  \n",
    "\n",
    "从以上三个embedding model我们可以看出`text-embedding-3-large`有最好的性能和最贵的价格，当我们搭建的应用需要更好的表现且成本充足的情况下可以使用；`text-embedding-3-small`有着较好的性能跟价格，当我们预算有限时可以选择该模型；而`text-embedding-ada-002`是OpenAI上一代的模型，无论在性能还是价格都不如及前两者，因此不推荐使用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from openai import OpenAI\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "\n",
    "\n",
    "# 读取本地/项目的环境变量。\n",
    "# find_dotenv()寻找并定位.env文件的路径\n",
    "# load_dotenv()读取该.env文件，并将其中的环境变量加载到当前的运行环境中  \n",
    "# 如果你设置的是全局的环境变量，这行代码则没有任何作用。\n",
    "_ = load_dotenv(find_dotenv())\n",
    "\n",
    "# 如果你需要通过代理端口访问，你需要如下配置\n",
    "# os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'\n",
    "# os.environ[\"HTTP_PROXY\"] = 'http://127.0.0.1:7890'\n",
    "\n",
    "def openai_embedding(text: str, model: str=None):\n",
    "    # 获取环境变量 OPENAI_API_KEY\n",
    "    api_key=os.environ['OPENAI_API_KEY']\n",
    "    client = OpenAI(api_key=api_key)\n",
    "\n",
    "    # embedding model：'text-embedding-3-small', 'text-embedding-3-large', 'text-embedding-ada-002'\n",
    "    if model == None:\n",
    "        model=\"text-embedding-3-small\"\n",
    "\n",
    "    response = client.embeddings.create(\n",
    "        input=text,\n",
    "        model=model\n",
    "    )\n",
    "    return response\n",
    "\n",
    "response = openai_embedding(text='要生成 embedding 的输入文本，字符串形式。')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "API返回的数据为`json`格式，除`object`向量类型外还有存放数据的`data`、embedding model 型号`model`以及本次 token 使用情况`usage`等数据，具体如下所示：\n",
    "```json\n",
    "{\n",
    "  \"object\": \"list\",\n",
    "  \"data\": [\n",
    "    {\n",
    "      \"object\": \"embedding\",\n",
    "      \"index\": 0,\n",
    "      \"embedding\": [\n",
    "        -0.006929283495992422,\n",
    "        ... (省略)\n",
    "        -4.547132266452536e-05,\n",
    "      ],\n",
    "    }\n",
    "  ],\n",
    "  \"model\": \"text-embedding-3-small\",\n",
    "  \"usage\": {\n",
    "    \"prompt_tokens\": 5,\n",
    "    \"total_tokens\": 5\n",
    "  }\n",
    "}\n",
    "```\n",
    "我们可以调用response的object来获取embedding的类型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "返回的embedding类型为：list\n"
     ]
    }
   ],
   "source": [
    "print(f'返回的embedding类型为：{response.object}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "embedding存放在data中，我们可以查看embedding的长度及生成的embedding。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "embedding长度为：1536\n",
      "embedding（前10）为：[0.03884002938866615, 0.013516489416360855, -0.0024250170681625605, -0.01655769906938076, 0.024130908772349358, -0.017382603138685226, 0.04206013306975365, 0.011498954147100449, -0.028245486319065094, -0.00674333656206727]\n"
     ]
    }
   ],
   "source": [
    "print(f'embedding长度为：{len(response.data[0].embedding)}')\n",
    "print(f'embedding（前10）为：{response.data[0].embedding[:10]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们也可以查看此次embedding的模型及token使用情况。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "本次embedding model为：text-embedding-3-small\n",
      "本次token使用情况为：Usage(prompt_tokens=12, total_tokens=12)\n"
     ]
    }
   ],
   "source": [
    "print(f'本次embedding model为：{response.model}')\n",
    "print(f'本次token使用情况为：{response.usage}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 二、使用文心千帆API\n",
    "Embedding-V1是基于百度文心大模型技术的文本表示模型，Access token为调用接口的凭证，使用Embedding-V1时应先凭API Key、Secret Key获取Access token，再通过Access token调用接口来embedding text。同时千帆大模型平台还支持bge-large-zh等embedding model。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import json\n",
    "\n",
    "def wenxin_embedding(text: str):\n",
    "    # 获取环境变量 wenxin_api_key、wenxin_secret_key\n",
    "    api_key = os.environ['QIANFAN_AK']\n",
    "    secret_key = os.environ['QIANFAN_SK']\n",
    "\n",
    "    # 使用API Key、Secret Key向https://aip.baidubce.com/oauth/2.0/token 获取Access token\n",
    "    url = \"https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={0}&client_secret={1}\".format(api_key, secret_key)\n",
    "    payload = json.dumps(\"\")\n",
    "    headers = {\n",
    "        'Content-Type': 'application/json',\n",
    "        'Accept': 'application/json'\n",
    "    }\n",
    "    response = requests.request(\"POST\", url, headers=headers, data=payload)\n",
    "    \n",
    "    # 通过获取的Access token 来embedding text\n",
    "    url = \"https://aip.baidubce.com/rpc/2.0/ai_custom/v1/wenxinworkshop/embeddings/embedding-v1?access_token=\" + str(response.json().get(\"access_token\"))\n",
    "    input = []\n",
    "    input.append(text)\n",
    "    payload = json.dumps({\n",
    "        \"input\": input\n",
    "    })\n",
    "    headers = {\n",
    "        'Content-Type': 'application/json'\n",
    "    }\n",
    "\n",
    "    response = requests.request(\"POST\", url, headers=headers, data=payload)\n",
    "\n",
    "    return json.loads(response.text)\n",
    "# text应为List(string)\n",
    "text = \"要生成 embedding 的输入文本，字符串形式。\"\n",
    "response = wenxin_embedding(text=text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Embedding-V1每次embedding除了有单独的id外，还有时间戳记录embedding的时间。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "本次embedding id为：as-hvbgfuk29u\n",
      "本次embedding产生时间戳为：1711435238\n"
     ]
    }
   ],
   "source": [
    "print('本次embedding id为：{}'.format(response['id']))\n",
    "print('本次embedding产生时间戳为：{}'.format(response['created']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同样的我们也可以从response中获取embedding的类型和embedding。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "返回的embedding类型为:embedding_list\n",
      "embedding长度为：384\n",
      "embedding（前10）为：[0.060567744076251984, 0.020958080887794495, 0.053234219551086426, 0.02243831567466259, -0.024505289271473885, -0.09820500761270523, 0.04375714063644409, -0.009092536754906178, -0.020122773945331573, 0.015808865427970886]\n"
     ]
    }
   ],
   "source": [
    "print('返回的embedding类型为:{}'.format(response['object']))\n",
    "print('embedding长度为：{}'.format(len(response['data'][0]['embedding'])))\n",
    "print('embedding（前10）为：{}'.format(response['data'][0]['embedding'][:10]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 三、使用讯飞星火API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 尚未开放"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 四、使用智谱API\n",
    "智谱有封装好的SDK，我们调用即可。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from zhipuai import ZhipuAI\n",
    "def zhipu_embedding(text: str):\n",
    "\n",
    "    api_key = os.environ['ZHIPUAI_API_KEY']\n",
    "    client = ZhipuAI(api_key=api_key)\n",
    "    response = client.embeddings.create(\n",
    "        model=\"embedding-2\",\n",
    "        input=text,\n",
    "    )\n",
    "    return response\n",
    "\n",
    "text = '要生成 embedding 的输入文本，字符串形式。'\n",
    "response = zhipu_embedding(text=text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "response为`zhipuai.types.embeddings.EmbeddingsResponded`类型，我们可以调用`object`、`data`、`model`、`usage`来查看response的embedding类型、embedding、embedding model及使用情况。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "response类型为：<class 'zhipuai.types.embeddings.EmbeddingsResponded'>\n",
      "embedding类型为：list\n",
      "生成embedding的model为：embedding-2\n",
      "生成的embedding长度为：1024\n",
      "embedding（前10）为: [0.017892399802803993, 0.0644201710820198, -0.009342825971543789, 0.02707476168870926, 0.004067837726324797, -0.05597858875989914, -0.04223804175853729, -0.03003198653459549, -0.016357755288481712, 0.06777040660381317]\n"
     ]
    }
   ],
   "source": [
    "print(f'response类型为：{type(response)}')\n",
    "print(f'embedding类型为：{response.object}')\n",
    "print(f'生成embedding的model为：{response.model}')\n",
    "print(f'生成的embedding长度为：{len(response.data[0].embedding)}')\n",
    "print(f'embedding（前10）为: {response.data[0].embedding[:10]}')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llm_universe_2.x",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
